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ABSTRACT 






Frequency transposition is the process of raising or 
lowering the frequency content (pitch) of an audio signal. 
The hearing impaired community has the greatest interest in 
the applications of frequency transposition. Though several 
analog and digital frequency transposing hearing aid systems 
have been built and tested, this thesis investigates a 
possible digital processing alternative. Pole shifting, in 
the z -domain, of an autoregressive (all pole) model of 
speech was proven to be a viable theory for changing 
frequency content. Since linear predictive coding (LPC) 
techniques arq used to cgde, analyze and synthesize speech, 
with the resulting LPC coefficients related to the 
coefficients of an equivalent autoregressive model, a linear 
relationship between LPC coefficients and frequency 
transposition is explored. This theoretical relationship is 
first established using a pure sine wave and then is 
extended into processing speech. The resulting speech 
synthesis experiments failed to substantiate the conjectures 
of this thesis. However, future research avenues are 
suggested that may lead toward a viable approach to 
transpose speech. 
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INTRODUCTION 



I. 



A. BACKGROUND 

Adjusting the frequency content or pitch of a signal 
is a topic researched within the audio field. The hearing 
impaired community has the greatest interest in the 
applications of frequency modification or transposition 
techniques. This is due to their need for auditory speech- 
processing aids. 

Auditory speech-processing aids are divided into two 
groups: those which involve nonradical processing of the 
speech signal^ with the speech still intelligible to a 
person with normal hearing, and those which involve radical 
re-coding of the speech signal CRef. l:pp. 547-557]. 

An example of radical recoding involves such systems as 



cochlear 


implants 


where 


the normal speech 


signal is 


processed 


into a 


series 


of vibrations 


that 


the brain 


interprets 


as sound . 


Individuals who have 


this 


type of aid 


surgically 


inserted in 


their cochlear 


must 


learn a 



completely different language than a person with normal 
hearing. Examples of nonradical processing aids include the 
most widely used amplifier aids and the less familiar 
frequency lowering devices or frequency transposition 
systems . 
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Most hearing aids amplify sound. Some aids may amplify 
or soften certain frequencies, while others transmit sound 
from the aid on one ear to the aid on the other ear. Their 
primary purpose, in either case, is to amplify everything 
they are capable of sensing. In this thesis, however, we 
are interested in developing an algorithm that may someday 
drive an aid which lowers the frequency content and 
preserves the intelligibility of the speech signal. 

B . FREQUENCY MODIFICATION 

Pickett CRef. 2 : pp . 191-194] categorizes two basic 

methods that have been used for frequency lowering: 

1. Frequency transposition, where a portion of the 
signal is separated out and resynthesized in a lower 
frequency band . 

2. Frequency division, where the frequency of the 
signal is reduced by a fixed ratio. 

All of the methods involve signal distortion. Signal 
distortion tends to increase with greater frequency shifts. 
Here we are concerned primarily with the idea of moderate 
frequency transposition, where the signal is shifted without 
major distortions in the information content. 

The earliest known suggestion of frequency lowering was 
by Perwitschky (1925). The earliest transposing hearing aid 
was built and tested by Johansson (1955). Since then, there 
have been several other systems built and tested, but 
considering the advances and trends of current technology. 
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research in the area of frequency transposition of speech 
has not been productive. 

Frequency transposition systems have utilized analog 
techniques such as frequency modulation (shifting an upper 
band to a lower band); frequency division (a slow playback 
of a tape recorded signal); and digital techniques such as 
sampling distortion (omitting segments of recorded speech) , 
and doppler (the delaying of the incoming signal). Though 
these methods have been developed and extensively tested^ 
the digital approach presented here may produce^ all 
together, different results. 

A 

Pickett confirms that the possibilities for usable 
frequency shifting algorithms have not been explored 
extensively enough to make recommendations for practice 
[Ref. 2:p. 1933. The research needs in this area include 
obtaining new information on the potential for digital re- 
coding, exploring the principles of transposition, finding 
which general cues can be sent in this way, finding the 
optimum parameters, and examining what system can be built 
that meets our general and specific needs. 

C. A NEW TECHNIQUE FOR FREQUENCY TRANSPOSITION 

Recently, Hall CRef. 3:p. 563 postulated that pole 
shifting in the z-domain using an auto-regressive (ail pole) 
model of speech may be a possible option for frequency 
lowering. He used linear predictive coding (LPC) techniques 
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to process the speech to determine if pole shifting was a 
viable option. His experimental results were positive 
because he was able to create a change in pitch on the input 
speech segment . 

This thesis is an extension of Hall's research. It 
ventures beyond the frequency domain model, and works 
directly with the linear predictive time domain model. It 
was postulated that a linear relationship exists between 
frequency content and the reflection coefficients determined 
using LPC. Once this theory has been postulated, a speech 
processing experiment was undertaken to determine if the 

A 

conjectures made were plausible. 

In this report linear prediction is introduced, the 
particular algorithms used to process the data are 
explained, and experimental research was carried out. 
Identical phrases of speech, spoken at different pitch 
levels by the same speaker, are sampled and processed. 
Possible patterns existing between the different pitch 
segments of speech and their linear predictive coefficients 
are analyzed. 

The results of this research indicate that there is no 
linear relationship that exists between the frequency 
content of speech and the LPC reflection coefficients, and 
recommendations are made for continued analysis concerning 
linear predictive coding and the frequency transposition of 
speech . 
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II 



MODELING SPEECH PRODUCTION 



A. INTRODUCTION 



In order 


to understand speech 


reproduction 


and 


synthesis , 


it 


’is useful 


to consider 


some of 


the 


basic 


elements 


that 


combine 


to produce 


speech . 


The 


most 


elementary 


model used to 


explain the 


production 


of 


speech 



is the human model illustrated below as Figure 1. 




Human Speech Production System CRef . 4:p. 42] . 

F igure 1 . 

The lungs produce the air flow necessary to begin the 
generation of sound. The vocal cords, tongue, mouth, lips 
and nasal tract combine their different properties to shape 
the airflow to produce the speech waveform we hear. 
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B. THE SPEECH PRODUCTION MODEL 



Evans CRef. 4:pp. 40-45] relates the several human 
functions to mechanical models. This is standard practice 
and a widely * accepted approach to speech production 
modeling. He states that the lungs are the excitation 
source for the vocal and nasal tract areas. An excitation 
source can either be modeled as a pulse train generator or a 
random number generator when reproducing speech. 

In the case of voiced sounds <ie. consonants, vowels or 
nasal sounds), the air released by the lungs is periodically 
modulated by vibrations from the vocal cords, glottis, and 
velum. Thus the excitation model in this case is a pulse 
generator. In the case of unvoiced sounds (ie. sh, sss, 
fff) which require no vibrations to be produced, the modeled 
excitation source is a random number generator. 

Both excitation sources produce a quasi -periodic wave 
form that we recognize as speech. That is, the period of 
the wave form varies with time depending on the sound being 
produced. This phenomena is most obvious in the production 
of voiced or vibrated sounds. Figure 2, a general discrete- 
time model of the human speech process, illustrates this 
point more clearly. Here we have represented the vocal 
tract model as a time-varying digital filter- 

Note that the pulse train has an input labeled pitch 
period. This input determines when the pulses will be 
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Discrete- time Model 



for Speech Production 



[Ref. 



4:p 



Figure 2, 



4 
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GENERfiTOR 



emitted from the pulse generator and at what periodicity. 
This is only necessary for voiced speech. 

The unvoiced speech is a continuous stream of random 
numbers commonly referred to as white noise. The flow of 
random numbers may produce a seemingly quasi-per iodic sound, 
however, since they are usually of such short duration, we 
consider the sound to be continuous and constant, and not 
periodic . 

Each speech waveform has a specific amount of energy. 
The energy contained within each utterance of a set duration 
will be referred to as gain (G) . This is what gives speech 

A 

its body or quality. It also aids reproduction by 
indicating the intensity or inflection of the voice signal. 

Once the voiced or unvoiced decision is made and an 
energy or gain is assigned, the scaled excitation function 
drives the vocal tract model. In a phone interview with 
James Kaiser of Bell Laboratories, he mentioned that 
current thinking in the area of speech reproduction has 
refocused its attention on this portion of the model and 
that there is a movement to more clearly describe the 
physics behind the different physical contributors of 
speech . 

This vocal tract model is driven by the excitation and 
energy function and controlled by time varying vocal tract 
parameters. These vocal tract parameters adjust the vocal 
tract model to yield the desired output waveform. By 
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replacing the vocal tract model with an equivalent time- 
varying digital filter that models the vocal tract model's 
response, we are able to step right into the next phase of 
synthetic speech reproduction. 

C. DIGITAL FILTER REPRESENTATION 

Although speech is modeled most efficiently by poles and 
zeros, it may also be modeled accurately by an auto- 
regressive (all pole) filter if the order of the filter is 
large enough. For example, a tenth order auto-regressive 
filter will accurately model most audible sounds. 

Therefore, the transfer function <H(z)) of the digital 
filter in Figure 4. is shown as Eq . 1-1. 

G 

H(z) = (2-1) 

P 

1 - ' ^ 

K= I 

where p is the order of the filter, G is the gain, and a^ is 
the filter coefficient. 

G and are the t ime- vary ing vocal tract parameters ±*or 
this filter. For a given segment of time (i.e., 10 milli- 

seconds) the vocal tract parameters are constant. However, 
stringing these segments together in rapid succession to 
produce a one second interval of speech, the parameters will 
change 100 times. This is why they are referred to as time 
varying; they vary over a short period of time. 
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The type of digital filter used in Figure 2 is 
arbitrary. It is the concept behind the diagram that 
counts. For the purposes of this research, the properties 
and attributes of a time-varying lattice filter are best 
because they lend themselves well to linear predictive 
coding implementation . 
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III. LINEAR_PREDICTION THEORY 



A. WHY LINEAR PREDICTION? 

Although spectral analysis is a well-known technique for 
studying signals^ its application to speech signals suffers 
from a number of serious limitations arising from the 
nonstationary as well as the quasiper iodic properties of the 
speech wave- By modeling the speech wave itself, rather 
than its spectrum, we avoid the problems inherent in 
frequency -domain methods . 

For instance, traditional Fourier analysis methods 
require a relatively long speech segment to provide adequate 
spectral resolution. As a result, rapidly changing speech 
events cannot be accurately followed CRef. 5:pp. 276-294], 

Linear predictive coding is applicable to a wide range 
of research problems including speech production and 
perception. One of the main objectives in any speech 
processing technique is the synthesis of speech which is 
indistinguishable from normal human speech. 

Atal noted that much can be learned about the 
information-carrying structure of speech by selectively 
altering the properties of the speech signal . He also 
stated that LPC techniques can serve as a tool for modifying 
the acoustic properties of the speech signal CRef. 5:p.276] . 
These are exactly the intentions of this thesis: to modify 
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the speech signal by investigating the properties of the 
information carrying structure- 

The remainder of this chapter is a summary of linear 
prediction theory. The major portion of this section is 
extracted from Makhoul's tutorial review on linear 
prediction [Ref- 6:pp. 124-1433^ and will be based on an 
intuitive approach^ with emphasis on the clarity of ideas 
rather than mathematical rigor - 

B. LPC THEORY 

In applying time series analysis^ each continuous signal 
s(t) is sampled to obtain a discrete-time signal s(nT), also 
known as a time series, where n is an integer variable and T 
is the sampling interval- The sampling frequency is then 
fs=l/T- Note that s(nT) will be represented as Sn in this 
discussion . 

The signal Sn is considered to be the output of some 
system with some unknown input Un such that the following, 
relation holds: 

Sn = - aksn-k * ^ ^ t>iun (3-1) 

where bi, and the gain G are the parameters of the 
hypothesized system. This equation says that the 'output' 
Sn is a linear combination of past outputs and present and 
past inputs- That is, the signal Sn is predictable from 
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linear combinations o£ past outputs and inputs. Hence the 
name linear prediction. 

C. PARAMETER ESTIMATION 

In the all-pole model, we assume that the signal sn is 
given as a linear combination o£ its past values and some 
current input Un • 

? 

3n = - aksn-k Gun (3-2) 

k=i 

which yields the following frequency domain transfer 
function 



G 

H(z) = (3-3) 

P 

1 + ^ akz"^ 

K=l 

Given a particular signal Sn# the problem is to determine 
the predictor coe££icients and the gain G in some 

manner . 

1 . Method_o£_Least_Sguares 

Here we assume that the input Un is totally 
unknown, which is the case of speech analysis. Therefore, 
the signal Sn can at best be approximately predicted from a 
linearly weighted summation of past samples. Let the 
approximation of sp be sn , where 
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p 



(3-4) 



3n = “ ^k^n-k 
K'l 

Then the error between the actual value sn and the predicted 
value Sn is given by 

P 

©n = Sn - Sn = Sn ^ a>^3n-k (3-5) 

K-l 

The quantity en is also known as the residual. In the 
method of least squares the parameters (aj^) are obtained as 
a result of the minimization of the expected value or mean 
of the error squared term, Ep = (T with respect to 

each of the parameters. Ep is the minimum mean square 

A 

prediction error, averaged over all n, and is represented by 

" r ** 1 

Ep Un + ^ ak Sn-kP (3-6) 

Ail ^ K=) J 

For any definition of the signal Sn^ s set of 
equations with a set of unknowns can be solved for the 
predictor coefficients which minimize Ep. 

There are two distinct methods for the estimation of 
these parameters, namely the autocorrelation method and the 
covariance method. Both methods are clearly described by 
Makhoul CRef. 6:pp. 126-127]. Since the autocorrelation 

method is the preferred method, only that method will be 
summarized here. 
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®* ^yt2£2EE®i5tiOQ_Method 

Here we assume that the error Ep is minimized 
over an infinite duration. Since 

^ QO 

R<i> = ^ 3n 3n+i (3-7) 

is the autocorrelation function of the signal Sn. 
Equation 3-6 reduces to 

P 

Ep = R(0) •»' ^ a^ R<k) (3-8) 

where R(0) is the total energy of the input signal and R(k) 
is the autocorrelation matrix of the input signal (see 
Figure 3) . 
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Autocorrelation Matrix 
Figure 3. 

It is a symmetric toeplitz matrix (a toeplitz 
matrix is one in which all the elements along the diagonal 
are equal). Since the signal Sn is known over only a finite 
interval, one popular method to control the size of the 
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toeplitz matrix is to multiply the signal Sn by a window 
function Wn- This yields a slightly different signal s'n# 
which is zero outside the finite interval. 

In any case, the autocorrelation matrix is the 
means for solving several of the linear predictive 
coefficients needed to analyze and synthesize speech. The 
following chapter discusses, in greater depth, what those 
coefficients are and how they are obtained. 
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IV 



LINEAR PREDICTION OF SPEECH 



A. INTRODUCTION 

As mentioned earlier, there are several ingredients or 
time-varying parameters that are needed to generate speech. 
When using linear predictive coding techniques, three 
ingredients are essential: gain or energy, pitch period, and 
the filter reflection coefficients or spectral envelope 
parameters . 

Figure 4 illustrates the fact that, depending on the 

A 

specified frame length, these ingredients must change every 
10 to 20 ms. On a frame-by-frame basis the incomming signal 
is processed to obtain the gain, the pitch period and the 
reflection coefficients kl, k2,...,kN. 

The pitch period and the gain parameters are used to 
construct an excitation function for production of either 
voiced or unvoiced speech- This driving or excitation 
function is input to a filter which is configured by the 
spectral envelope parameters determined from the analysis. 
The output is one frame of synthetic speech, and by 
stringing several frames of speech together, audible sounds 
are produced CRef. 7 : pp . 337 - 345]. 

Analysis of the speech signal is done by calculating the 
LPC model parameters for each 10 ms time frame. This 
chapter will discuss these essential parameters. 
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LPC Model o£ the Human Voice CRef. 7 

Figure 4 . 



:p. 338] 
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l/A/Z^ICSD SOl/A/Di 

(mne Atoise soo^te) 



B. LPC ENCODING PARAMETERS 



1 . Vgiced_/_ynvgiced_Decision_Making 

Some sounds require the vibrations induced by the 
vocal cords, while others do not. Voiced sounds represent 
those that require an excitation from the vocal cords or 
lips. Unvoiced sounds are generated by a steady flow of air 
as in the case of 's^ or 'f'. A decision must be made in 
order to properly excite the digital filter to produce the 
desired sounds. 

According to Atal CRef. 5:p. 2803 the voiced/ 

unvoiced decision is based on the ratio of the mean-squared 
value of the speech samples to the mean-squared value of che 
prediction error samples. This ratio is considerably 

smaller for unvoiced speech sounds than for voiced speech 
sounds. Typically, this ratio is a factor of 10. 

Voiced Decision: ECsn3 > 10 ECen3 

Unvoiced Decision: ECsn3 < 10 ECen3 

This decision will determine whether to excite che 
digital filter with an impulse function or white noise, each 
having a particular gain or energy. 

2 • Q^l0_S2I!lPyt^ti9D 

In explaining the least squares method of linear 
prediction we assumed that the input was unknown. 
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Equation 3-5 can be rewritten as 



P 

sn = - ^ ak sn-k ♦ (4-1) 

Ks| 

Comparing Equations 3-2 and 4-1 we see that the only input 
signal Un that will result in the signal Sn as output is 
that where Gun Sn • That is, the input signal is 

proportional to the error signal. For any other input the 
output will be different than Sn • Therefore the energy of 
the input signal must be equal to the energy of the output, 
signal Sn • 

Since the filter H(z) is fixed, it is clear from the 
above that the total energy in the input signal Gun must 
equal the total energy in the error signal, which is given 
by Ep. Again, Makhoul CRef. 6:p. 128] is the primary source 

for this information and he provides additional mathematical 
background in determining the resultant gain equation 

P 

G2 = Ep = R(0) ^ ak RCk) (4-2) 

where is the total energy in the input and R(k) is. 

again, the autocorrelation matrix. 

The classification of a sound as voiced or unvoiced 
determines the input to the filter H(z). However if the 
input Gun is white noise or a series of impulses, the gain 
is calculated from the same equation. 
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3. 



Pitch Period 



The period of time that elapses between each 
excitation pulse is referred to as the pitch period. Atal 
[Ref. 5:p. 279] describes two different methods for 
determining pitch period. His second method is summarized 
here since it is based on the linear predictive 
representation of the speech wave. 

In this method, except for a sample at the beginning 
of each pitch period, every sample of the voiced speech 
waveform can be predicted from the past values . The method 
of determining pitch period is relatively simple. 

Once the prediction error of the speech signal is 
determined through linear predictive processing, the largest 
or peak values are noted, (Figure 5). These points 
determine the times that excitation pulses should be 
initiated from the excitation source. This simple peak- 
picking procedure was found to be effective in determining 
pitch period as developed in Reference 7. 
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Pitch Period Estimation 



Using Peak 



Picking 



Figure 5. 
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4. 



i§ct ion Coefficients 



Earlier it was mentioned that the reflection 
coefficients determined using LPC are directly related to 
the polynomial coefficients of an all pole model. This 
section will show the relationship between them and 
illustrate how the reflection coefficients are determined. 

Recall that we are looking for an estimated output 
which is the weighted sum of past system outputs (see 
Eqns. 3-4 and 3-5) . The autoregressive (AR) model in 
Figure 6 illustrates this process. 




Autoregressive Model 
Figure 6. 



The goal of LPC is to adjust the a^'s to minimize 
Ep. Achieving it involves solution of a linear system of 
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equations, using Levinson's algorithm, and leads to the 
lattice structure AR model we are most interested in (see 
Figure 7). The mathematical development for this may be 
found in Parker CRef. 9:pp. 110^112]. 




Lattice Structure Analysis Model 
Figure 7. 

Lattice structuring requires the determination of 
reflection coefficients, hereafter referred to as K. The 
K's of an n-th order Lattice filter transfer function are 
related to the polynomial coefficients of an nth order AR 
filter transfer function through the following matrix 
equation : 



(N+l) 


• (N> 


<N + 1) 


<N) 


a = 


a 


+ K 


-a 




0 




1 




• 
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where 






^ T(N) (N) 

ryy • «k 






Ryy 



K 



(4-4) 



T(N) 



(N) 



Ryy (O) 



5k 



ryy 



The matrix ryy is the last column of the Ryy 



autocorrelation matrix mentioned earlier 



The notation has 



been slightly altered from Parker^s presentation CRef. 9:p. 
1123 to be consistent with the preceding chapters of this 
development . 

Equations 4-3 and 4-4 have been included in this 
presentation to show how the polynomial coefficients (a^^'s) 
are related tho the reflection coefficients (K^s), 
However^ there is an easier and more direct method towards 
determining K's. A brief development is presented here. 

Working in the Z-domain, we know that the transfer 
function of the AR model is 




(4-5) 



and 
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where 



A (z) ia A(z) in reverse order 



Combining and reforming in matrix form, yields 

A 'I '"y- IK 



(4-7) 



or more simply 



and 



Writing Equation 3-5 in the Z-domain yields 



N N 

E(z) = A(z) S(z) (4-10) 



Combining 4-10 with 4-8 and 4-9 and returning to the time 
domain, yields the following error equations. 






cuui 



(4-11) 



(4-12) 



(N + 1) "-(N) 

where e (k) is the forv;ard difference error, and e (k) is 

the backwards difference error. Equations 4-11 and 4-12 
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correspond to the lattice implementation in Figure 7. They 
have been used to determine the K^s o£ a 12th order model in 
the sine wave and speech experiments which follow. 

The order of the filter is simply determined by 
assigning N. For speech, anywhere from a 6th to a 12th 
order model has been found to be sufficient - 

The reflection coefficients are determined every 10 to 
20 milli-seconds and when lined up side by aide appear to 
present a spectral envelope, (Figure 8). 
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Figure 8. 

Determining the reflection coef f icients • in any case, 

is a straight forward calculation which is an attractive 
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feature of LPC. It is the pattern these K^s may produce in 
our experiment that we will be most interested in. 

5 . Sgectr al^Anal ysis 

A convenient way to portray the frequency content of 
speech is through the determination of formant frequencies. 
Formant frequencies are the most prominent frequencies 
present in a speech waveform. 

Formant frequencies are not required to produce LPC 
synthesized speech. In other words, given the voiced 
decision, gain, pitch period, and the reflection 
coefficients, one has enough information to reconstruct the 
speech wave form. However, the determination of the formant 
frequencies aids us in depicting a frequency transposition. 

The complex roots of the denominator polynomial 
are the complex formants (bandwidths and frequencies) used 
to approximate the speech signal. The coefficients, a>^ ,of 
the denominator polynomial are obtained from time-domain 
calculations on samples of a short segment of the speech 
waveform; namely Csn) = Csi , S2 ,..-sm), where N>>p. Here 
N is the number of samples, and p is the order of the 
polynomial CRef. ll:pp. 364-3663. 

Under the assumption that the waveform 
samples, Sn # are samples of a random gaussian process, the 
entire speech sample is broken up into an equal number of 
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samples which we will refer to «s segments, (Figure 9). 
Each segment is processed using the Fast Fourier Transform 
(FFT) and then low pass filtered if desired. 




Flow Chart for Obtaining the Spectral Content 
of One Complete Utterance 

Figure 9. 

The output of each segment contains the speci:rai 
content of that segment. Each segment is sequenced together 
to yield a time-varying frequency content profile of the 
entire utterence with each segment containing its particular 
frequency content. The formant frequencies are rhe most: 
prevalent, or peak, frequencies found in the speech wave 
form . 
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c. 



SPEECH SYNTHESIS 



A speech signs! is synthesized by using the same 
parameters determined with LPC analysis. A block diagram of 
a speech synthesizer was shown in Figure 4. The control 
parameters supplied to the synthesizer are the pitch period » 
a binary voiced or unvoiced parameter, the rms value of the 
speech samples or gain, and the predictor or reflection 
coef f icients . 

The pulse generator produces a pulse of unit amplitude 
at the beginning of each pitch period. The white noise 
generator produces uncorrelated uniformly distributed random 
samples with standard deviation equal to 1 at each sampling 
instant. The selection between the pulse generator and the 
white noise generator is made by the voiced-unvoiced switch. 
The synthesizer control parameters are reset to their new 
values at the beginning of every pitch period for voiced 
speech and once every 10 msec for unvoiced speech. 

The amplitude of the excitation signal is adjusted by 
the amplifier G. The linearly predicted value Sn of the 
speech signal is combined with the excitation signal Un to 
form the n-th sample of the synthesized speech signal. The 
signal is finally low-pass filtered to provide the 
continuous speech wave (sn> • Atal CRef. 5:p. 2803 provides 
the mathematical development needed to synthesize these 
parameters. A mathematical discussion will not be pursued 
further here. 
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V. 



DIGITAL FREQUENCY TRANSPOSITION 



A. INTRODUCTION 

The object of this research was to determine an 
algorithm that will digitally transpose speech using linear 
predictive coding. In this chapter. Hall's research CRef. 
33 will be briefly discussed and summarized. A new theory 
will then be postulated and a simple experiment using pure 
sine waves will be presented to test the credibility of the 
theory. Keep in mind that the real test will be the actual 
processing of speech, this section simply sets the scene for 
further study. 

B. " POLE ’shifting IN THE Z-PLANE 

Only the highlights and summary of Hall's thesis will be 
presented here. His goal was to change the pole locations 
before reconstruction (of the sampled speech signal) to 
produce the output voice with different pitch and format 
frequencies while retaining a natural sound and the same 
information CRef. 3:p 473. 

The autoregressive vocal tract transfer function used in 
his research is represented by Equation 5-1. 
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H<2) 



1 



(5-1) 



-2TT(BW)T5 -1 -4TT(BW)Ts -2 

1 - 2e cos( 2 ‘RFT 5 )z ♦ e z 

where F is the center frequency of the formant, and BW is 
the bandwidth of the formant. The pole locations associated 
with this transfer function are: 

z = A e - 

Converting Equation 5-1 into polar form produces Equation 
5-2. 



1 

H(z) = <Eqn 5-2) 

-1 2 -2 

, 1 - 2A cos 0z A z 

Through several mathematical manipulations and solving 
for A and 0, the following relationships for F and BW are 
determined: 

F = 0 / 2 n T 
BW = (-In A ) / 2iT T 

-2TT(BW) T 

where A = e and © = 2 n FT 

Assuming that a linear relationship exists between F 
(the original frequency) and F^ (the modified frequency), 
several general expressions are stated to illustrate the 



(5-3) 

(5-4) 
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underlying modification to the pole * locations • Note that 

the following equations are all linear relationships. 



F ' = F ( 5 - 5 ) 

BW ' = c<BW (5-6) 

0 = ^0 (5-7) 

(5-8) 

The most important consideration for producing these 



relationships is guaranteeing that no unstable poles will be 
created by shifting them outside the unit circle. For more 
of the specifics on Hall's development see Reference 3, 
pages 49 and 50. 

Two experiments are illustrated in Hall's thesis. They 

are : 

1. Pitch was reduced by a factor of .58 and the 
formant frequencies reduced by .88 for voiced 
speech. 

2. The same modification was done for a segment of 
unvoiced speech. 

Hall concluded that upon completion of the process most 
listeners agreed that, although the input speech was female, 
the modified output speech sounded typically male. It was 
also noted that although the audio output was somewhat 
lacking in quality, it was intelligible CRef 3:p. 73]. The 
tapes which recorded that audio output are no longer 
available for subjective evaluation. 
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Linear predictive coding is a means to an end for Hall. 
He modifies the the variables mentioned (F,BW,0,A), and 
processes the speech with LPC computer programs. This 
conversion between an autoregressive vocal track model and a 
LPC model (implemented most easily by a lattice filter 
configuration) is possible through Equations (4-3) and 
(4-4) . 

The mathematics are simple. What is most important here 
is that the relationship between the two different 
representations of speech, the AR model and the LPC model, 
are closely associated with one another. To calculate one, 
in a sense, is to calculate the other. 

C. A NEW PROPOSITION 

1 . Statement_of ^Theory 

As mentioned earlier, LPC techniques can serve as a 
tool for modifying the acoustic properties of the speech 
signal. This thesis postulates that a linear relationship 
exists between the reflection coefficients, which determine 
the spectral envelope of the speech wave form, and the 
frequency content of that wave form. If this relationship 
exists and the linear relationship is determined, then by 
selectively modifying the reflection coefficients, the 
frequency content will also be modified. 



Is 


there a linear 


relationship 


between 


the 


ref lection 


coefficients (K'' s ) 


and frequency 


content? 


The 
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first step in our proof is to analyze the most simplified 
case. Since speech is often represented as a combination of 
many different frequencies^ the simplest case would be to 
analyze a fixed frequency sine wave. If the results turn 
out to be negative, then exploring the more complex case 
(speech) would probably be futile. 

2 • Sine_Waye_Exper iment 

At any given frequency a pure sine wave may be 

considered a continuous energy and amplitude signal which 

will generate an audible pitch when it is within the 200 Hz 
to 15 kHz audible range. When dealing with normal speech 
wave forms, the audible pitch range is somewhere between 
200 Hz and 5 kHz. 

A computer program was written in Fortran CApp. Aj , 
for use on the IBM 3033 to produce a sine wave for further 

analysis. The resultant sine wave could be sampled at any 

desired rate and the frequency of the wave could be 
incremented to satisfy the range requirements of 200 Hz 
5 kHz. 

Once the sampling rate was determined and the sine 
wave frequency set, the reflection coefficients were 
calculated for a 10ms time frame, stored in a holding file 
and plotted to determine if a relationship exsists between 
frequency and the nth order K's. 
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To determine 12 reflection coefficients (K's) for 



each frequency^ Equations 4-11 and 4-12 were used. 
Additional runs were also made to determine the affect of 
noise on the outcome. The results were promising. 

3 . Sine_Waye_Exper imental_Resul ts 

Appendixes B and C illustrate the apparent linear 
relationship that exsists between frequency and the LPC nth 
order K's in a noiseless environment. Appendixes D and E 
illustrate that same relationship in a noise environment 
<S:N = 10: 1) . 

It would appear that a linear relationship does 
exist between the different frequencies of a sine wave. 
Noise on the other hand changes that linear relationship. 
Noise addition seems to affect K7 through K12 much more 
than K1 through K6 . 

Considering the mathematics involved in calculating 
K, these observations are reasonable. Since the later K's 
are affected most by small changes in the input signal, 
addition of noise will affect them more drastically than the 
earlier stages . 

Though these observations are promising, they are by 
no means conclusive. If no correlation between the K's and 
frequency existed^ another scheme would have had to be 
considered. Nevertheless, speech is the more complicated 
signal that we consider in the next two sections. 
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VI. SPEECH PROCESSING EXPERIMENT 



A. INTRODUCTION 

Now that the fundamentals of linear predictive coding 
have been presented and a theory of frequency transposition 
proposed, it is necessary to work directly with speech 
itself. To obtain the information we are seeking, the 
correlation between reflection coefficients and frequency 
content, speech samples must be demonstrated. 

Documentation concerning the data acquisition sysrem 
used in this research to obtain speech samples is provided 
as Appendix F. This chapter discusses the data itself, and 
the processing of it. 

4 

B. VOICED/UNVOICED PHRASES 

Three phrases were chosen for their voiced and unvoiced 
characteristics as described in Chapters 2 and 4. They are: 

1) ••READY** 

2) *‘S0 WHAT- 

3) -SNEEZE- 

Each phrase was repeated at a different pitch ana zo 
make things simple, the musical scale was picked za help 
harmonize a change in pitch with some type of reference. In 
other words, ••READY** was first spoken in the middle-C range. 
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and then in the D range, until it was finally spoken in the 
high-C range. 

This procedure yielded eight different pitches for each 
of the three phrases. One male speaker provided the data 
for all three phrases. Additionally the period remained 
constant for each pitch and their individual utterances. 
For a graphical representation of the selected speech 
utterances, refer to Appendices G, H, and I. 

Each phrase was chosen for content and can be classified 
as voiced, unvoiced, or a combination of both. "READY" is 
strictly a voiced word, whereas "SO WHAT" and "SNEEZE" are a 
combination of voiced and unvoiced segments. The S,WH, and 
T sounds in "SO WHAT" will be our unvoiced example, and 

"SNEEZE" will be the combined example as the data is 

« 

analyzed . 



C. DATA PROCESSING 

This section discusses the techniques utilized to 
analyze the data and the observations made. 



1 . Speech_Data 

The raw speech data was edited and displayed using a 
generic display program. The data is 8 bit information with 
a maximum range of 256 equally spaced values- The 
resolution of each utterance varied with the pitch. The 
lower frequencies tended to have less gain or energy and 
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therefore did not use all the 256 range values available. A 
summary of the ranges is provided in Appendix J. 

The periods of each phrase were different. The 
differences between the same utterance at different pitches 
varied as much as 20 msec, A short summary of the average 
periods are given in Table 1. 



TABLE 1. 



UTTERANCE 

••XXX" 


PERIOD 
sec . 


NO. SEGMENTS 
N 


NO. DATA PTS./SEG 
(10 msec SEG) 


"READV 


.30 


30 


100 


••SO WHAT^^ 


.40 


40 


100 


••SNEEZE’* 


.38 

J 


38 


100 



The sampling rate was 10 kHz for all of the 
utterances, so the number of data points in each 10 msec 
segment is 100. 

Once the starting point is determined for each 
utterance, the reflection coefficients are calculated for 10 
msec segments of speech CApp. K3 . Successive segments are 
analyzed to yield their respective reflection coefficients 
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using Equations 4-11 and 4-12, as were the sine wave 
calculations . 

Reflection coefficients K1 through KS were plotted 
for each of the 24 utterances and several of the resultant 
curves are included as Appendix L. 

X?I®Q^-Anal^sis 

A graphical trend analysis of the plotted data 
was undertaken to detect any obvious patterns. The details 
of that analysis is included as Appendix M. However, a 
summary of those observations leads us to the conclusion 
that there were not any trends of any significance noted as 
a function of pitch. 

One graph was held stationary as a reference and 
the others were passed over it to see if there was any 
obvious match ups. There is nothing more elaborate to 
report than that no correlation was noted between them. 
Even though at times there were 2 or 3 points which matched 
up, the other 28, 36, or 38 points did not. Also there 
seemed to be no distinction between voiced and unvoiced 
portions of the speech wave. This process leads to the 
conclusion that the various speech segments are highly 
uncorrelated . 
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3 . §2®St5!®i_Anal;^3i3_o£_Ref lect ion_Coef f ici ent_Patterns 



It wa3 noted during the trend analyeis that the 
temporal patterne presented by the reflection coefficients 
seemed periodic. At first it was believed that this could 
possibly reflect the pseudo-periodic nature of speech or the 
excitation source . 

Spectral analysis was implemented using a Fortran 
subroutine to compute the FFT of each pattern. The program 
is included as Appendix N and several examples of the 
results are provided as Appendix 0- 

In summary all of the spectra turned out to be 
relatively flat. This indicates that there are no prominent 
frequencies within the reflection coefficient sequences. 

^ lysis_f gr_Freguency_Content 

Spectral analysis to determine the frequency content 
of each utterance, as described in Chapter 4, would have 
been useful had a pattern or linear relationship shown up in 
the observations mentioned. 

Since there are no patterns or correlations worth 
mentioning, exploring the specific frequency content of each 
utterance would not benefit us. The relative difference 
between each frequency, or A f is approximately 32 Hz. 

The range of the utterances was chosen to coincide 
with the musical scale from middle-C to high-C (a 256 Hz 
difference) . Had a relationship been discovered, as 
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proposed^ then a more in-depth spectral analysis of the 
input speech would have been in order. 

D. SUMMARY OF EXPERIMENTAL RESULTS 

The linear relationship postulated in Chapter 5 
should have yielded more obvious results if relationships 
did exist between identical phrases spoken at different 
pitches. Three of the four categories mentioned above 
yielded negative or uncorrelated results. 

2 • Voiced/Unyoiced^Obseryations 

Though there may be other or more sophisticated 
techniques available to analyze this data, the methods 
mentioned above were sufficient to show that a voiced phrase 
was no more correlated than an unvoiced phase. 

Since the results were consistently negative or 
uncorrelated leads us to some conclusions about the actual 
relationship between frequency content and reflection 
coef f icients . 
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VII 



CONCLUSIONS AND RECOMMENDATIONS 



A, CONCLUSIONS 

A new theory to transpose frequency was postulated and 
tested. Initial results, using sine waves, seemed positive 
and lead to a further study using speech waveforms. The 
preceding experiment and subsequent analysis of speech 
showed no apparent correlation between pitch and reflection 
coefficient values. These results may be attributed to the 
following reasons . 

1 . Complex! ty_of_Speech 

The speech wave form is a very complex combination 
of gain, excitation, and spectral content. To pick out one 
particular attribute and analyze it for a particular 
phenomenon, such as frequency content, may be unrealistic. 

Speech has historically been modeled as a 
combination of sine waves. However, slow progress in the 
field of speech processing has caused engineers to rethink 
this point in terms of the physics involved in generating 
speech. This leads to our next conclusion. 

The experimental results indicate that, in this 
case, there is no obvious relationship between the physics 



51 



(pitch) of speech and the LPC mathematical representation of 
speech (reflection coefficients) . 

This observation makes sense since reflection 
coefficient determination is based on probabilistic methods, 
error feedback, and random input samples, the resultant 
output of each lattice stage no longer resembles the 
original signal . Once the error signal passes through the 
first stage of the lattice network, its characteristics have 
been altered as much as 10 percent. Reflection coefficients 
are therefore a tool for determining predicted error 
calculations based on past inputs, and not a physical 
interpretation of the signals content. 

Just as engineers are in error when they refer to the 
pattern that successive reflection coefficients present as 
its spectral envelope, reflection coefficients do not 
directly reflect the frequency content of the signal. 

3 - E®?; io^ic/Pseudo-Per iodic_Dif f erences 

Simulation and experimental results show that 
reflection coefficients work differently with periodic 
signals (sine wave) than with pseudo-per iodic signals 
(speech) . 

In calculating the reflection coefficients for a 
sine wave, the samples of one frequency are changed very 
slightly from the previous frequency's samples. Therefore 
the calculated reflection coefficients also change very 
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slightly- This observation may be useful in the design of 
an LPC musical synthesizer, where frequency content and 
adjustment is processed in a controlled environment. 

On the other hand, speech behavior is more random 
than music. It is pseudo-periodic in the sense that 
complex vibrations are necessary to produce the speech 
waveform. However, the rate and randomness at which those 

vibrations change frequencies seems to prevent the 

reflection coefficients from having any kind of linear 
relationship with frequency content. 

It is therefore the conclusion of this research that 
the relationship between frequency content of speech and 
reflection coefficients is sufficiently complex that 
modifying reflection coefficients in order to transpose 
pitch will not be practical. 

B . RECOMMENDATIONS 

The conclusions have stated that there is no linear 
relationship present between frequency content and 
reflection coefficients. Recall that the motivation behind 
this research was based on Hall's research CRef. 3] 
concerning pole shifting- Therefore the following act-ions 
are recommended if further or more extensive study is 
desired - 

1. Continue Hall's research using LPC as a tool for 
speech analysis/synthesis, but focusing attention on 
the shifting of poles and not on the adjustment of 
reflection coefficients - 
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2. Use a data acquisition system that yields 12 or 16 
bit resolution of the speech samples. 

3. Build a larger data base containing speech 

utterences at different pitch levels and have the 
speakers be both male and female. 

4. Have the ability to match articulation patterns and 
synchronize points where speech utterences begin and 
end • 

5. Synthesize the input and processed speech to check 
for intelligibility of the utterences. 

6. Use more sophisticated techniques for pattern 

recognition . 

It is believed that the preceding recommendations, if 
followed, will help substantiate or refute Hall's research 
as well as the findings of this research. The need for an 
adequate technique for frequency transposition still- exists. 
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APPENDIX A 



Z RiELiCIIQN_CQEFFICIENI_DETERriINAIION_FQR_A 
FREQUENCY VARIED SINEWAVE PROGRAM 
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REFLECTION COEFFS K1-K6 



APPENDIX B 



REFLECTION COEFFICIENTS K1-K6 (NOISELESS) 
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400 800 1200 1000 2000 2400 2000 3200 3000 4000 4400 4800 0200 

FREQUENCY (HZ) 
lOK SAMP/SEC. NOISELESS 



REFLECTION COEFFS K7-K12 



APPENDIX_C - RiFLECTI0N_C0EFFICIENTS_K7-K12_<NQISELES^ 



e 
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FREQUENCY (UZ) 
lOK SAMP/SEC. NOISELESS 



REFLECTION COEFFS K1-K6 



APPENDIX D 



REFLECTION COEFFICIENTS K1-K6 <S:N=10:1) 
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FREQUENCY (HZ) 
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REFLECTION COEFFS K7-K12 



APPENDIX E 



REFLECTION C0EFFICIENTS_K7-K12 <SlN= IQl 12 
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FREQUENCY (HZ) 
lOK SAMR/SEC, S:N=I0:1 



APPENDIX F 



DATA ACQUISITION SYSTEM 



A. INTRODUCTION 

There are a vast number of data acquisition systems on 
the market today. Though this is the case, the system 
originally planned for the acquisition of this data, broke 
down with no hope of timely repair. When all possible 
alternatives had been explored, it was decided that the only 
way to accomplish this portion of the research was to build 
a system capable of obtaining speech data samples. 

This section will discuss the system, hardware, and 
software utilities that were combined to produce the desired 
data samples. In an effort to provide the novice, as wei 
as the expert, with the information needed to retrace these 
steps, anything worth documenting, is. Additionally, a 
bibliography is provided in the main Bibliography of this 
thesis . 

B. EQUIPMENT REQUIREMENTS AND SETUP 

Figure 10 shows the experiment. Selected speech 
utterances were recorded on a 4-channel, 8-track t.ape 
recorder and stored for later use. The analog to digita* 
(A/D) circuit was built and driving software written. This 
circuit was interfaced with the Zenith-100 microcomputer 
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through the Prolog 7804-Z80A Processor Counter /Timer Card 
and the 8255 Parallel Peripheral Interface (PPI) microchip. 
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Data Acquisition 3-Dimensional Flow Chart 

Figure 10. 

Once the data was captured in the Prolog's 32K buffer, 
it was uploaded to the Zenith-100, via ZMDS soi'tware, and 
stored in Intel-Hex data files. The files were transferred 
from the Zenith formatted disk, via an Osborne 
microcomputer, and placed on Kaypro formatted disks. 

A Kaypro 10 microcomputer converted the hexadecimal data 
into decimal data using Microsoft Basic (MBASIC) software. 
Edited versions of these files were then transferred to the 
IBM-3033 main frame computer for data processing. 
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ANALOG TO DIGITAL CIRCUIT 



The chip that provides the analog to digital conversion 
is the AD-570. It provides 8-bit information at sampling 
rates up to 33K samples/second. For our purposes, the 
sampling rate was set at lOK since the majority of the 
frequency content is below 5 kHz. 

The circuit diagram CApp. F.23 illustrates the 
interfacing between the 8255 PPI chip and the Host computer. 
The 8255 coordinates all of the necessary handshaking in 
driving the AD-570 chip. 

It was necessary to amplify the signal prior to entering 
the AD-570, to obtain full use of the 256 amplitudes 
available. It was also necessary to provide an adjustable 
DC-offset to assure a unipolar input <i.e. the middle value 
had to be adjusted to be level 128 instead of level O) . 

Also, the signal was filtered prior to data acquisition, 
through the use of a Butterworth filter designed with a 
frequency cutoff of 5 kHz. This helps smooth the data. 
However, during the processing of the data it may be 
necessary to filter it again. These additional circuits are 
also provided as Appendix F.2. 

D. MICROCOMPUTER INTERFACE 

The flow chart, provided as Appendix F.3, illustrates 
the Z-80 assembly language program. Appendix F-4, that was 
needed to drive the A/D circuit and collect the speech data. 
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The program, A2D.ASM, was also useful in testing, step by 
step, the proper operation of the circuit. 

The Z-80A micro-processor is at the heart of the system 
and the software designed to drive it is assembled using 
the Macro Assembler <M80) and linked to the Prolog station 
using Link software <L80) . For more information on these 
procedures refer to the Bibl iography - 

1 • Sampl ing_Rate 

The sampling rate is not arbitrary. It is a 
function of the software. In assembly language programming 
each step that the microprocessor goes through takes a 
specific amount of time. We will refer to a measure of time 
as a T state. Each T-state equals the inverse of the clock 
rate interfaced with the Z-80 chip. Since we are using a 4 
MHz clock, one T-state equals 250 nano-seconds. 

Every command line in the assembly program, 
including the command 'No Operation' or NOP, requires 
several T-states to accomplish its task. We are interested 
in the interval of time it takes from one sample to the 
next, and then we modify the software accordingly. 

This program has a delay loop in it (labeled DELAY) 
to slow down the data acquisition to lOK samples/second. If 
it did not have the delay loop in it, it would easily sample 
at 23K samples/second. Since each utterance was limited to 
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Ies3 than one second 



lOK samples is workable and does not 



present prohibitive record lengths. 

E. DATA FILE SETUP AND MANIPULATION . 

Once the data is collected and stored in the Prolog^s 
32K buffer^ it is uploaded onto a Zenith 100 formatted 
floppy disk and stored in an appropriately titled HEX file. 
A sample of a typical segment of data is provided as 
Figure 11. 

:lC5Al0G0e07IE07E7l’7I797776797I7I7E7rEC£3I5 
‘ ;105Af 0e0e5£5a6e6£e£5£3S07I7F?ee07C727I7I93 
: 10530000508 ie0e0ei£0£17?ee7J 5 27E7E7C7E7I A3 
: 1 053 10007I7C7E7E7I73 £08080 71 E07FZ07L7E7 157 
:ie53220e777I7E7C7773777F£3£e7rS0a7£:£A£77A 
:10£E300eSee££f:£3 7I‘7C7E7E7D737S7I£e£3£iee6A 
: 10534000308 lc2£2e07I737C7C7E7C7E7 773 £05053 

Hexadecimal Data File Segment 
F igure 1 1 . 

The file is in Intel-Hex format. The colon starts off 
each line. The following '10' tells us that the line is 
full of data. The next four digits indicate the memory 
location in the buffer. Every two bits following the memory 
location represents a byte of information. 

Following a double 0, there are 16 records of data, and 
then a checksum byte at the very end. For our purposes the 
first nine digits and the last two digits are of no use. 
The Intel-Hex file is already in ASCII format. 

An Osborne microcomputer was used to transfer data from 
the Zenith 100 formatted floppy aisk to a Kaypro formatted 
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floppy disk. Since the data is needed in integer for» to do 
the necessary processing, a program was written CApp. F.5], 
in Microsoft Basic Language (MBASIC) , to convert the data 
files from hexadecimal into the equivalent integer values. 

Finally, the data is ready for processing. Since the 
software was already written on the IBM-3033 to process and 
display the data, it was sent there via a 1200 baud modem, 
and processed . 
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APPENDIX F.2 



CIRCUIT DIAGRAMS FOR THE DATA ACQUSITION SYSTEM 







Figure 12* 



Pin Out of the 8255 Programmable Peripheral Interface 








Figure 13. 



Figure 14. 



Adjustable Gam and 
DC-Offset Circuit 



Low Pass, 2 Pole 
Butterworth Filter 
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APPENDIX F.3 




SQFTWARE_FLOW_CHARI_FgR_IHE_ASSEMBLY_LANGyAGE 
DATA ACQUISITION PROGRAM 
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THIS PRCGRAM IS riSIGKPD TC CAPTURE AED SAVE TO THl PROIOG'S f'EMORY 
CUTFUT 8-BIT LATA FROM AN FXURNAL CIRCUIT. TC TRANSFER THE PROLOG 
MMCRY TO UE LOST COMPUTER, ECR STORAGE ON A EISK FIlE, USE ZNES. 



APPENDIX F.4 



ASSEMBLY LANGUAGE PROGRAM 



This 



program is designed 



to acquire 



speech 



data . 
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APPENDIX F.5-INTELHEX TO DECIMAL DATA CONVERSION PROGRAM 



This program ia designed to read a data file that is in 
Intelhex format and convert it to an integer file. 



£ PRINT "INPUT file INPUT FI» 

4 PRINT "OUTPUT FILE ": INPUT FQt 
£0 OPEN "□",£, F05 
30 OPEN "I", 1, Fit 
40 INPUT #1, IN» 

60 IF NIDt ( INt, E, 1 ) ="0" IHtN CLOSEtUOrO 140 
70 FOR 1=10 TO 40 STEP ^ 

60 HXt=r«UDt(lNt, 1, £) 

90 V*/.=VPL ( " &H"+HX3.) 

95 IF i='40 THEN PRINT «£, US I NU " ; V% ELSE PRIN) «£, US I NC3" ; V% 

98 IF 1=40 THEN PRINT USING "»♦»«" ;V*/. ELSE PRINT USING "»»4";V%; 

100 NEXT I 
1£S PRINT 
130 GOTO 40 

140 PRINT "DONE" + LHKt</) 
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APPENDIX G 



READY-E 










0*091 



Q'SCI 

3CmiN3bW 



0*09 



0*0 



0*0 



This is an example of the sampled utterence 'Ready^ 
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0.00 O.Oi O.IQ 0.11 0.30 0,21 0.x 0.x Q.X 

TIHC IN see. 



APPENDIX H 



SO WHAT-HIGH C 




■ ■ T ■ 

fnur o*oit o*« o'oi 0*0 0*0 

DQOLINSUU 



This is an example of the sampiea utterance 'So What', 
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0.00 ■ 0.01 0.19 Q.tt 9.90 0.23 0,D 0.M O.IQ O.a 0-10 

line IN see. 




APPENDIX I 



SNEEZE-F 




roa r%a (rtai o'zi crm o*» ro 



GQrUINSUU 



Thxs i3 an example of the utterence 'Sneeze'. 
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Tjn£ JN see 



APPENDIX J 



SPEECH RESOLUTION SUMMARY 



This table lists the actual ranges used by each 
utterance out of a possible 256 levels (from O to 255) . 



UTTERENCE SCALE REFERENCE RANGE 



READY 



SO WHAT 



SNEEZE 



MIDDLE-C 

D 

E 

F 

G 

A 

B 

HIGH-C 

MIDDLE-C 

D 

E 

F 

G 

A 

B 

HIGH-C 

MIDDLE-C 

D 

E 

F 

G 

A 

B 

HIGH-C 



60 


- 


220 


52 


- 


230 


10 


- 


255 


0 


- 


255 


35 


- 


255 


10 


- 


220 


25 


- 


250 


10 


- 


255 


60 


- 


175 


10 


- 


220 


15 


- 
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APPENDIX K 



SPEECH REFLECTION COEFFICIENT PROGRAM 



This program is 
coefficients for 
speech. It yields 



designed to 
a 12th order 
a new set of 



determine the 
lattice filter 
K's every 10 ms. 
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APPENDIX L 



- Ei£tiCIIQN_COEFFICIENX_PATTERNS_FOR 

"speech wave forms 




Figure L.l. Sneeze-E Pattern of Reflection Coefficient Kl. 
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TIME X 10 MSEC 




Figure L.2. Sneeze-E Pattern of Reflection Coefficient K2. 
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TIME X 10 MSEC 



o 




Figure L.3. Sneeze-E Pattern of Reflection Coefficient K3. 
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TIME X 10 MSEC 




Figure L.4. Sneeze-E Pattern of Reflection Coefficient K4 . 
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TIME X 10 MSEC 




Figure L.5. Sneeze-E Pattern of Reflection Coefficient K5. 



85 



TIME X 10 MSEC 
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Figure L.6. Sneeze-E Pattern of Reflection Coefficient K6. 
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APPENDIX M 



TREND ANALYSIS RESULTS 



The following lists are the observations made on the 
reflection coefficient curves for each utterence. 



••READY’* 

K1 “ All pitches have relatively flat curves- The 

magnitudes vary slightly between • 8 and The higher 

the pitch, the more defined the troughs are. 

K2 “ These curves all had the unique feature of sloping 
upward- They generally ranged from --4 to ^-9- No orher 
correlation was noted. 

K3 - A negative sloping tendency characterised this set of 
curves - 

K4 - Each of these curves had a plateau- Ready 3, however 
did not fit in with this set at all- 

K5 - These curves seemed to stay within a similar range. .3 
to -.7. Also several prominent peaks were uncorrelated . 

K6 - No correlations were noted, however Ready-B was 
drastically different - 



"SNEEZE- 



K1 - Relatively flat curves- Ranges from .8 to 1-0- 
K2 - Highly uncorrelated curves. 

K3 - Also highly uncorrelated curves. however. more flat 
than K2 - 



K4 - NC and D are similarly flat. the rest seem correlated 
with a valley to an elevated flat plateau. 



K5 - There seems to be a peak, then a declining 
most of these curves- Again MC and D don't 
observation and are generally flat- 



trend in 
fit this 



K6 - There are several peaks, then relatively flat curves. 
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••so WHAT^" 



K1 - Similarly flat patterns, 

K2 - Highly uncorrelated with no recognizable patterns, 

K3 - There is a prominent valley in all of the observations 
except A, 

K4 - Highly uncorrelated with no recognizable patterns, 

K5 - Highly uncorrelated with no recognizable patterns, 

K6 - Highly uncorrelated with no recognizable patterns, 

A 



SS 



APPENDIX N 



FAST FOURIER TRANSFORM PROGRAM 



This program determines if there are any discrete 
frequencies existing within the reflection coefficient 
patterns . 
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APENDIX^O - ESi9yENCY_CONXENI_OF_KlN2. 

This IS an example of the output from the FFT program 
determine if there are any discrete frequencies present 




Figure 0,1. Reflection Coefficient K6 for 

Utterence 'Ready-MC' . 
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Figure 0.2. Reflection Coefficient K3 for 

Utterence ■'Ready-MC'. 
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Figure 0.3. Reflection Coefficient K4 for 

Utterence 'Sneeze-MC'. 
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Figure 0.4. Reflection Coefficient K1 for 

Utterence 'Sneeze-MC' . 
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Figure 0.5. Reflection Coefficient K6 for 

Utterence 'So What-MC' . 
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Figure 0.6. Reflection Coefficient K4 for 

Utterence 'So What-MC' . 
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